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Summary:  Over  the  course  of  this  project,  the  team  worked  on  several  fronts  to  tackle 
problems  related  to  evaluating  and  validating  the  machine  learning  techniques  that  were  being 
developed  to  classify  content  shared  via  social  media.  We  began  by  focusing  on  text-based 
content  shared  via  social  media.  Here  we  collected  data  in  the  form  of  social  media  user's 
comments  to  news  articles  about  hydrofracking  (or,  fracking).  This  topic  was  chosen  because  it 
is  a  timely  social  and  political  issue.  Nearly  300  posts  were  analyzed  with  automated  linguistic 
inquiry  tools  and  results  show  that  systematic  patterns  in  language  use  were  predicted  by 
geographic  location  of  users.  Next,  we  expanded  our  efforts  to  address  image-based  content 
shared  via  social  media.  We  opted  to  collect  all  images  shared  via  twitter  associated  with  the 
active  conversation  on  the  topic  of  gun  control  (if g uncontrol).  Again,  this  topic  was  chosen 
given  it's  central  position  in  the  public  sphere  concerning  social  and  policy  issues  in  the  United 
States.  Our  team  first  wrote  a  custom  python  script  to  track  and  collect  all  media— specifically 
images— shared  over  the  course  of  one  month  on  twitter,  via  the  conversation  identified  by 
#guncontrol.  We  simultaneously  developed  a  human  image  coding  protocol  so  that  shared 
images  could  be  classified  along  dimensions  including  frame,  appeal,  and  valence.  This 
classification  protocol  was  designed  so  that  it  could  be  used  by  the  machine  learning  algorithms. 
Next  we  trained  human  coders,  and  classified  all  images.  We  were  able  to  predict  which  image 
attributes  predicted  diffusion  and  propagation  across  twitter.  However,  problems  included  the 
difficultly  of  obtaining  consistent  results  from  human  coders,  because  human  coders  vary  in 
terms  of  their  positions  on  political  and  social  issues  like  gun  control.  Results  indicate  that 
shared  images  with  attribute  frames,  fear  and  humor  appeals,  and  positive  valence  are 
retweeted  more  often.  Also  retweeted  more  frequently  are  messages  from  users  with  larger 
networks  and  whose  tweets  contain  hashtags.  Results  also  show  a  significant  negative 
relationship  between  the  time  since  the  last  major  shooting  event  in  the  United  States  and  the 
likelihood  that  messages  with  images  are  retweeted.  These  results  are  meaningful  considering 
the  context  of  evolving  mass  media  systems  and  online  social  networks. 

Media  Coverage  vs.  Online  Discussion:  We  aim  to  compare  the  public  opinions  of  specific 
topics  like  "Gun  Control"  on  media  coverage  and  online  discussion.  The  LexisNexis  database 
provides  archived  newspaper  articles  and  news  transcripts  from  major  TV  networks,  which  can 
be  used  as  the  media  coverage  part.  For  the  online  discussion  part,  we  developed  software  to 
crawl  comments  from  Facebook  pages  of  mass  media  (e.g.  CNN  News  on  Facebook)  and 
comments  on  mass  media's  websites.  By  specifying  the  date  range,  our  software  is  able  to  crawl 
the  online  comments  automatically.  In  total,  we  have  collected  over  1  million  comments  from 
Facebook  pages  of  mass  media  including  CNN,  ABC,  CBC,  NBC  News,  Fox,  WSJ,  etc.  Some 
content  analysis  software  (e.g.,  LIWC)  could  be  used  to  analyze  the  crawled  data.  To  facilitate 
the  analysis,  we  also  developed  several  text  processing  tools  to  deal  with  the  output  results  of 
LIWC.  We  developed  a  web  application  (crawler),  which  fetch  news  from  NY  Times.  It's 


developed  using  the  API  provided  by  NY  Times.  Our  crawler  has  a  simple  GUI  that  allows  users 
to  search  news  about  certain  keywords  in  a  certain  period,  and  saves  the  results  in  the 
commonly  used  CSV  format.  A  further  study  is  an  exploratory  attempt  to  use  automatic 
linguistic  analysis  for  understanding  social  media  users'  news  commenting  behavior.  The  study 
addresses  geographically-based  dynamics  in  human-computer  interaction,  namely,  users'  tie 
to  a  geographic  community.  Specifically,  the  study  reveals  that  commenting  behavior  differs 
between  users  of  different  levels  of  local  community  tie.  Comments  by  local  users,  those  with 
higher  level  of  local  community  tie,  exhibit  different  linguistic  patterns  in  comparison  to 
national  users  who  are  less  involved  in  local  community.  The  linguistic  differences  are  reflected 
in  the  use  of  pronouns,  personal  pronouns,  social  words,  swear  words,  anxiety  words  and  anger 
words.  We  argue  that  identification  of  these  differences  is  crucial  in  the  practice  of  mining 
social  media  conversations  for  public  opinion. 
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Image  Retweeting:  To  study  the  image  retweeting  problem,  we  developed  a  web  application 
that  fetches  tweets  from  twitter  with  given  keywords,  in  a  certain  time  period.  It's  developed 
based  on  the  API  of  Twitter.  In 
order  to  facilitate  our  analysis, 
we  applied  machine  learning 
and  computer  vision 
algorithms  to  the  crawled 
images.  One  interesting 
application  is  human 
identification.  Given  an  input 
image,  we  need  to  determine 
whether  it  contains  a  human 
body.  Particularly,  we  utilized 
a  part-based  human  detection 
algorithm  on  the  collected 
image  data  set.  Some  images 
in  this  data  set  have  been 
manually  labeled  by  our 
collaborators,  which  enables 
us  to  evaluate  the  performance  of  machine  learning  algorithms.  Subspace  segmentation  is  one 
of  the  hottest  issues  in  computer  vision  and  machine  learning  fields.  Generally,  data  (e.g., 
images  of  human  faces)  are  lying  in  a  union  of  multiple  linear  subspaces,  therefore,  it  is  the  key 
to  find  a  block  diagonal  affinity  matrix,  which  would  result  in  segmenting  data  into  different 
clusters  correctly.  Recently,  graph  construction  based  segmentation  methods  attract  lots  of 
attention.  Following  this  line,  we  propose  a  novel  approach  to  construct  a  Sparse  Graph  with 


Fig.  1.  Framework  of  our  proposed  method.  Inputs  are  various  facial 
images  taken  under  different  conditions  (illumination,  expression,  gender, 
etc).  Here  we  show  face  images  from  three  different  individuals.  We 
propose  a  novel  approach  to  construct  a  graph  (middle  part).  By  virtue 
of  the  block  diagonal  property  by  Frobenius  norm  (top-middle)  and  sparse 
property  (bottom-middle),  a  block-wise  and  sparse  graph  can  be  achieved. 
Specifically,  in  sparsity  part,  we  take  both  unbalanced  data  (left)  and  balance 
data  (right)  into  consideration  using  /j-nearest  neighbor  and  6-matching 
method,  respectively.  The  final  clustering  result  is  shown  in  the  right  part. 


Block-wise  constraint  for  face  representation,  named  SGB  (Fig.  1).  Inspired  by  the  recent  study 
of  least  square  regression  coefficients,  SGB  firstly  generates  a  compact  block-diagonal 
coefficient  matrix.  Meanwhile,  graph  regularizer  brings  in  a  sparse  graph,  which  focuses  on  the 
local  structure  and  benefits  multiple  subspaces  segmentation.  By  introducing  different  graph 
regularizers,  our  graph  would  be  more  balanced  with  b-matching  constraint  for  balanced  data. 
By  using  knearest  neighbor  regularizer,  more  manifold  information  can  be  preserved  for 
unbalanced  data.  To  solve  our  model,  we  come  up  with  a  joint  optimization  strategy  to  learn 
block-wise  and  sparse  graph  simultaneously.  To  demonstrate  the  effectiveness  of  our  method, 
we  consider  two  application  scenarios,  i.e.,  face  clustering  and  kinship  verification.  Extensive 
results  on  Extended  YaleB  and  ORL  demonstrate  that  our  graph  consistently  outperforms 
several  state-of-the-art  graphs.  Our  algorithm  achieves  an  accuracy  of  more  than  98%  (see 
following  table).  We  also  developed  algorithms  to  automatically  analyze  the  facial  expressions, 
age,  and  emotions. 

averagk  clustering  accuracy  (%)  oe  different  methods  on  dataset  ORL  AND  Extended  YaleB  (5  and  10  subjects). 
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Abstract 

The  project  team  worked  on  several  fronts  to  tackle  problems  related  to  evaluating  and  validating  the  machine  learning 
techniques  that  were  being  developed  to  classify  content  shared  via  social  media.  We  began  by  focusing  on  text-based 
content  shared  via  social  media.  Flere  we  collected  data  in  the  form  of  social  media  user's  comments  to  news  articles  about 
hydrofracking  (or,  fracking).  Next,  we  expanded  our  efforts  to  address  image-based  content  shared  via  social  media.  We 
opted  to  collect  all  images  shared  via  twitter  associated  with  the  active  conversation  on  the  topic  of  gun  control  (#guncontrol). 
We  simultaneously  developed  a  human  image  coding  protocol  so  that  shared  images  could  be  classified  along  dimensions 
including  frame,  appeal,  and  valence.  Next  we  trained  human  coders,  and  classified  all  images.  We  were  able  to  predict 
which  image  attributes  predicted  diffusion  and  propagation  across  twitter.  Flowever,  problems  included  the  difficultly  of 
obtaining  consistent  results  from  human  coders,  because  human  coders  vary  in  terms  of  their  positions  on  political  and  social 


issues  like  gun  control.  Results  indicate  that  shared  images  with  attribute  frames,  fear  and  humor  appeals,  and  positive 
valence  are  retweeted  more  often.  Also  retweeted  more  frequently  are  messages  from  users  with  larger  networks  and  whose 
tweets  contain  hashtags.  Results  also  show  a  significant  negative  relationship  between  the  time  since  the  last  major  shooting 
event  in  the  United  States  and  the  likelihood  that  messages  with  images  are  retweeted.  These  results  are  meaningful 
considering  the  context  of  evolving  mass  media  systems  and  online  social  networks. 
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